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Abstract 

Many applications in data analysis rely on the decomposition of a data matrix 
into a low-rank and a sparse component. Existing methods that tackle this task 
use the nuclear norm and f r cost functions as convex relaxations of the rank con- 
straint and the sparsity measure, respectively, or employ thresholding techniques. 
We propose a method that allows for reconstructing and tracking a subspace of 
upper-bounded dimension from incomplete and corrupted observations. It does 
not require any a priori information about the number of outliers. The core of 
our algorithm is an intrinsic Conjugate Gradient method on the set of orthogonal 
projection matrices, the so-called Grassmannian. Non-convex sparsity measures 
are used for outlier detection, which leads to improved performance in terms of 
robustly recovering and tracking the low-rank matrix. In particular, our approach 
can cope with more outliers and with an underlying matrix of higher rank than 
other state-of-the-art methods. 

1 Introduction 

The detection of subspaces that best fit high-dimensional data is a challenging and 
important task in data analysis with countless applications. The classic Principal Com- 
ponent Analysis (PCA) is still the standard tool for searching the best rank-fc approx- 
imation of a high-dimensional data set X € R mx ", where the approximation quality is 
measured in terms of the Frobenius norm. On the one hand, this minimization problem 
can be solved easily via a Singular Value Decomposition (SVD), while on the other 
hand the choice of the Frobenius norm makes this approach highly vulnerable to heavy 
outliers. In the past years a lot of effort has been spent on methods that allow to find 
the best fitting subspace despite the presence of heavy outliers in the measurements or 
missing data. Certain approaches in this area commonly known as Robust PCA stick 
closely to the classic PCA concept and robustify it by measuring the distance from the 
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data points to the subspace with ^i-cost functions, cf. Ding et al (2006}, Kwak ( 2008| l. 
Others, including the approach presented here, relate Robust PCA to the field of robust 
matrix completion. That is, given some incomplete observations X of a corrupted and 
possibly noisy data matrix X, which is the superposition of a low-rank matrix L and a 
sparse matrix S , the task is to recover L and S . 



1.1 Related Work 

One of the most prominent approaches to Robust PCA in the past years is proposed 
by |Candes et al (2009} and Wright et al (2009). The authors formulate a convex op- 
timization problem by relaxing the hard rank constraint to a nuclear norm minimiza- 
tion and analyze how well the l\ -relaxation approximates the Ai-norm in the low-rank 
and sparse decomposition task. Methods to solve the problem include Singular Value 



Thresholding (SVT, |Cai et al| ( [2010} ) and the exact (EALM) or inexact (IALM) aug- 
mented Lagrangian multiplier method (Lin et al 2010 1, which are fast and comparably 
reliable if the underlying assumptions, i.e. low-rank-property and sparsity of outliers, 
are valid. The problem of reconstructing the low-rank matrix L in the case where en- 
tire columns in the measurements are corrupted is considered by |Chen et al ( |201 1 1. A 
method called SpaRCS is proposed by [Waters et al| ( 20TT) , which recovers the low-rank 
and the sparse part of a matrix from compressive measurements. 

In many applications it is reasonable to assume that no additive Gaussian noise is 
present in the model, as noise is negligible compared to the signal power, e.g. when 
dealing with high-quality sensors in the field of Computer Vision. However, there also 
exist approaches such as GoDec (Zhou and Tao 201 1] > that explicitly model additional 
Gaussian noise. The method uses thresholding techniques and extends the low-rank 
and sparse decomposition to the noisy case. SpaRCS and GoDec aim at recovering 
both the low-rank matrix L and the outliers S from noisy measurements and therefore 
require an upper bound for the cardinality of the sparse component. 

In contrast to the often-performed rank-relaxation, there also exist methods which 
fix the dimension of the subspace approximation, such as the greedy algorithm GECO 



( Shale v-Shwartz et al 2011 1. This method reconstructs a dataset iteratively based on 



S VD while increasing the rank of the approximation in each step. A general framework 
for optimizing over matrices of a fixed rank has also been proposed by |Shalit et~al| 
(2010). One drawback here is that the manifold requires a fixed rank k and optimal 
points that may have a rank strictly lower than k are not in the feasible set. 

An alternative and elegant way to control the rank in terms of an upper bound is 
optimization on the set of fixed-dimensional subspaces, the so-called Grassmannian. 
As pointed out by |Meyer et al| ( |2011| ), optimizing on the Grassmannian offers many 
advantages, such as limited memory usage and a lower number of parameters to opti- 
mize. Although the optimization problem becomes non-convex, reliable performance 
can be achieved in practice. In the work of |Keshavan and Montanari (2010), spec 



tral techniques are combined with a learning algorithm on the Grassmannian. Boumal 



|and A bsil (2011 ) propose a Riemannian trust-region method on the Grassmannian for 
low-rank matrix completion, which, however, does not consider heavy outliers. The 
GROUSE algorithm (Balzano et al 2010[ ) furthermore demonstrates that optimization 
on the Grassmannian allows to estimate the underlying subspace incrementally. In- 
stead of batch-processing a data set, the samples can be processed one at a time, which 
makes it possible to track a subspace that varies over time, even if it is incompletely 
observed. A small portion of the data is used to obtain an initial subspace estimate and 
with every new incoming data sample this estimate is modified to follow changes in the 
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dominant subspace over time. While the method of Balzano~et al|(2010)) operates with 



^2-cost functions, its recent adaptation GRASTA (|He et al 201 2} performs a gradient 



descent on the Grassmannian and aims at optimizing an t\ -cost function to mitigate the 
effects of heavy outliers in the subspace tracking stage. The authors overcome the non- 
differentiability of the 1 1 -norm by formulating an augmented Lagrangian optimization 
problem at the cost of doubling the number of unknown parameters. 

1.2 Our Contribution 

Although the ^i-norm leads to favorably conditioned optimization problems it is well- 
known that penalizing with non-convex ^-surrogates allows reconstruction even in the 
case when ^i-based methods fail, see e.g. Chartrand and Staneva (2008). Therefore, 



we propose a framework that combines the advantages of Grassmannian optimization 
with non-convex sparsity measures. Our approach focuses primarily on reconstructing 
and tracking the underlying subspace and can operate on both fully and incompletely 
observed data sets. The algorithm performs a low-rank and sparse decomposition. 
However, it does not require any information about the cardinality or the support set of 
the sparse outliers, thus being different from SpaRCS and GoDec. 



In contrast to GRASTA (He et al 2012), the method presented in this paper directly 



optimizes the cost function and thus operates with less than half the number of un- 
knowns. Like all optimization methods on the Grassmannian our algorithm allows to 
upper-bound the dimension of the underlying subspace and easily extends to the prob- 
lem of robustly tracking this subspace. 

Experimental results confirm that the proposed method can cope with more outliers 
and with an underlying matrix of higher rank than other state-of-the-art methods, thus 
extending possible areas of application from the strict low-rank and sparse decompo- 
sition to the more general area of robust dimensionality reduction. In the following 
section we present an alternating minimization scheme and relate our approach to di- 
mensionality reduction via PCA. Subsequently, we carefully derive and explain a Con- 
jugate Gradient (CG) type algorithm on the Grassmannian for solving the individual 
minimization tasks. We then extend the static method by a dynamic subspace track- 
ing algorithm and finally, we evaluate the performance of the proposed method with 
various ^o-surrogates and compare our approach to other state-of-the-art methods. 



2 Problem statement 

Let X G R mx " be the data matrix of which we select the partial entries X = A(X) using 
a linear operator. We consider the data model X = L + S where L is of rank rk(L) < k 
and S is sparse. Our aim is to recover L from the observations X. A direct approach 
leads to the numerically infeasible optimization problem 

mm\\±- A(L)\\ . (1) 

rk£<* 

Since rk(L) < k, we can write the matrix L as the product L = UY, where Y e R* x " and 
U is an element of the so-called Stiefel manifold S% m = [U e W" xk \U T U = h\ with I k 
denoting the (k x £)-identity matrix. This factorization of L has the following practical 
interpretation: U is an orthonormal basis of the robustly estimated subspace of the first 
k principal components, hereafter referred to as (robust) dominant subspace. Note, 
that neither U nor Y are uniquely determined in this factorization. Let O(k) = {9 e 
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R kxk \6 T 6 = Ik) define the set of orthogonal matrices, then U can always be adjusted by 
an orthogonal matrix 8 e 0(k), such that UY = U69 T Y with T Y having uncorrected 
rows, leading to the k (robustly estimated) principle component scores of X. 

Our concession to the numerical infeasibility of ([TJ is to employ an alternative, 
possibly non-convex sparsity measure h. Some concrete examples can be found in 



Section 6.1 As a result, problem ([TJ relaxes to the minimization problem 



min h(X- A(UY)). (2) 

UeSt kj „,YeR tx " 

Problem Q can be addressed in two different ways. Either the full data set X is pro- 
cessed at once, resulting in one estimate for U and Y, or the data vectors x are processed 
one at a time. In the latter case, while for each new sample x one obtains a correspond- 
ing optimal coordinate set y, the subspace estimate U should vary smoothly over time 
and fit both recent and current observations. We will refer to the former problem as 
subspace reconstruction or Robust PCA and to the latter as subspace tracking. 



3 An alternating minimization framework using sur- 
rogates 

Directly tackling problem Q has two severe drawbacks. The first is the above men- 
tioned ambiguity in the factorization UY and the second is the fact that reasonable 
sparsity measures are not smooth and thus forbid the use of fast smooth optimization 
methods. We overcome these two problems by proposing an alternating minimization 
framework that gets rid of the ambiguity by iterating on a more appropriate geometric 
setting and allows us to use smooth approximations of h, which combine the advantage 
of having fast smooth optimization tools at hand together with the strong capability of 
non-convex sparsity measures in reconstruction tasks. 

So let hp : R mx " — > R. with p e R + be a smooth approximation of h such that h^ 
converges pointwise to h as p tends to zero. Monotonically shrinking the smoothing 
parameter p between alternating minimization steps allows to combine the advantages 
of both smooth optimization and £o-\\ke sparsity measures. Schematically, we tackle 
problem (|2]i by iterating the following two steps until convergence. 
Step #1 Let L w = f/ (!) F w be the z'-th iterate of the minimization process. In the first 
step, the estimate of the dominant subspace is improved by solving 

U ii+l) = arg min h^X - A(UU T L (i) )). (3) 

UeSt kj „ 

We master the ambiguity problem by reformulating (TSJ as a minimization task on the 
set of symmetric rank-£ projectors, which possesses a manifold structure and is known 
as the Grassmannian 

Gr Ml := [P e R mxm |P = UU T , U e St k , m }. (4) 
This results in the optimization problem 

P <J+1) = arg min h M (X - A{PL®)), (5) 

withF (, ' +1) = £/('' +1 >({/('' +1 >) T . 
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Step #2 In the second step, the coordinates with respect to t/ (,+1) of the projected 
data points are adjusted by solving 

= arg min h^X - A(U (i+1) Y)). (6) 

Then, fi is decreased previous to the next iteration. Since ordinary PCA of the data 
allows a reasonably good initialization of U and Y, we initialize our algorithm with 
a truncated SVD of some matrix Xq that is in accordance with the measurements, i.e. 
A(Xq) - X. The alternating scheme is summarized in Algorithm uj 



Algorithm 1 Alternating scheme for Robust PCA 
Initialize: 

Choose X , s.t. A{X ) = X. 

Obtain U (ir> from k left singular values of Xq. 

y(0) = u( 0)t Xq 
jJO) _ {/(0)y(0) 

p(0) _ JJ(P)JJ(0) T 

Choose yu (0) and compute 

for i = 1 : / do 

= arg min /y,> ( x ~ A(PL {i) )) Step #1 
findf/ (/+1) s.t. [/(i+Dc/(i+i)T =jP (i+i) 
y('+D = arg min h^QL - A(U {i+l) Y)) Step #2 

_ fy-(i+l)y(i+l) 

end for 

L = U (I >Y (I \ § = X - L 



4 Optimizing the smooth sparsity measure 

The proposed alternating minimization scheme involves the two non-convex but smooth 
optimization problems Q and (|6j. We minimize both using a Conjugate Gradient type 
method due to its scalability and superlinear rate of convergence. The optimization 
methods are conceptually very similar but differ in the domains they operate on, as 
becomes clear from the cost functions and their respective gradients. 

/, : Gr*, m -> R, Ph h M (X - A(PL)\ V/i = -A^Vh^X - A(PL)j) L T (7) 
f 2 : R kx " -> R, Kh - ^(I/F)), V/ 2 = -f/ T ^*(v/j /J (X - ^(I/T))) (8) 

Note that A* denotes the adjoint of the operator A. Conjugate Gradient methods for 
minimization problems in the Euclidean case, such as §8^ are standard and well estab- 
lished. In contrast to this, the core of our algorithm, i.e. minimizing (|7j is a geometric 
optimization problem on the real Grassmannian and thus requires additional concepts 
such as vector transport and retraction. 
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In the following, we will recall some general concepts and further collect the ingre- 
dients for our algorithm. In particular, we derive a new retraction that is crucial for the 
algorithm's computational performance. For a deeper and more general insight into the 
topic of Geometric Optimization we refer to the work of |Absil et al| ([2008). 



4.1 Geometry of the Grassmannian 

Consider a projector P e Gr^ m as a point on the Grassmannian and let u(m) := {Q e 
R mx '"|Q T = -Q) be the set of skew-symmetric matrices. Using the Lie bracket operator 
[Zi,Z2] = Z1Z2 -Z2Z1, the set of elements in the tangent space of Gr^ >m at P is given by 
T P Gr kM = {[P,£l] I D. 6 u m }. In the following, we will endow R" ,xm with the Frobenius 
inner product (Z^Zj) := tr(Z^Z2) where tr( ) denotes the trace operator, and consider 
the Riemannian metric on Gr^,,, accordingly as the restriction of this product to the 
respective tangent space. The orthogonal projection of an arbitrary point Z s R mxm 
onto the tangent space at P is 

„ p : r ,x '" -> T P Gr M „ Z .-» [P, [P, Z,]] (9) 

with Z s = |(Z + Z T ) being the symmetric part of Z. 



It is crucial for optimization procedures on manifolds to establish a relation be- 
tween elements of the tangent space and corresponding points on the manifold, which 
motivates the usage of retractions. Conceptually, a retraction at some point P is a map- 
ping from the tangent space at P to Gr^ ,„ with a local rigidity condition that preserves 
gradients at P. 

The generic way of locally parameterizing a smooth manifold is via Riemannian 
exponential mappings. As they are costly to compute in general we perform an approx- 
imation based on the (^-decomposition. Let Gl(m) be the set of invertible (m x m)- 
matrices and let lZ(m) c Gl(m) be the set of upper triangular matrices with positive 
entries on the diagonal. It follows from the Gram-Schmidt orthogonalization procedure 
that the (^-decomposition 0(m) x lZ(m) — > Gl(m), (Q, R) i-> QR is a diffeomorphism. 
Accordingly, every A € Gl(m) decomposes uniquely into A -: AqAr with Aq € O(m) 
and Ap e lZ(m). Moreover, it follows that the map 

qa : R -» (Km), q n (t) := (7 m + tO) Q (10) 



is smooth for all O € u(m) with derivative at being qn(0) = £1, cf. Kleinsteuber and 
Hliper| | |2007) . Using 



a' P : TpGikjn -> Gr*, m , Q'^(^) := 9[f,pj(l) P fe,P](l)) T (11) 

and exploiting the properties of the exponential mapping, one can develop the following 
result: 

Lemma 1 Consider arbitrary orthogonal matrices 6 e 0(m). The mapping 

a P , e : 7> Gr*, m -> Gr tm , or^t© := (^^(1)) T P (?r K ,p]e(l)) T T (12) 
defines a set of retractions on Gr^ m . 

Proof 1 The first condition, namely ap t g(0) — P, follows straightforwardly. It remains 
to show that ^| f _ a f,e( f ^) - 4- 
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Firstly, note that q\ t ^p ](i) = qy p](f). Since it has been verified that <?n(0) = £2 
(cf. \Helmke et a l (2007)) and [£, P] is skew symmetric, it follows that 

j t \ t=Q a P ,o(t{) = 0q e r^ P]g (O)0 T P0(q e r^p ]g (O)) T 6 T + 9q g r^ P]g (O)0 T P0(q g r^ P]e (O)) T T 

= 6 T [£, P]0 T p - P0 e T [£ P]0 T 

= [P,[P,&] 
= t 

To understand the last step of the equation, note that ^ e Tp Gvk,m and therefore, £, is 
invariant under the projection 7Tp(-). □ 

As an associated vector transport, i.e. a mapping that for a given £ € Tp Grt m trans- 
ports the tangent element 77 e IpGr^,,, along the retraction Q , p,e(^ r ) to the tangent space 
Tap^) Gr i>m , we choose 

t^pM := (qertfMV) t t] (q e - lm e(l)) T T . (13) 

Note that in our algorithm the context of t is always clear. Thus, we will drop the 
subscripts and simply write T(rf) for enhanced legibility. 

4.2 CG on the Grassmannian 

In the following, we sketch how the well-known nonlinear CG method extends to the 
Grassmannian for minimizing a smooth function /: Gr^,,, — > R. Recall, that if / is the 
restriction of a smooth function /: R mxm — > R, the Riemannian gradient in the tangent 
space is given by 

grad/CP)=MV/), (14) 

where V/ is the common gradient of / in R mxm . The CG method on the Grassmannian 
can be outlined as follows. Starting at an initial point P (0) e Gr^, the Riemannian 
gradient r <0) can be computed and Z/ (0) = -F (0) is selected as initial search direction. 
In each iteration suitable step-size ? (,) is determined using a backtracking line-search 
algorithm on the Grassmannian, cf. Algorithm [2] The new iterate is then obtained via 
= apvifi*!!®). Finally, the search direction 

H (M) = _ r (/+U + 0>) T (H®) (15) 

is updated, where we consider two update rules for/?®, namely 

= <r<' +1 >,r<' +1 >) (0 = <r'' +1 >,(r(' +1 >-T(r<')))> 
p ™ <r®,r©> ' Pm <T(//(')),(rc+i)_ T (r(0)))- 1 ' 

The former is a Riemannian adaption of the well-known Fletcher-Reeves update for- 
mula and guarantees convergence of our algorithm (see the subsequent remark). The 
latter is an adaptation of the Hestenes-Stiefel formula and typically leads to better con- 
vergence behavior in practice, which is why we use it for all our algorithms. 

Algorithm 2 Backtracking line search on Grassmannian 



Choose ti„n > 0; c,p € (0, 1) and set t <— t mil 
repeat 
t <- pt 

until f{a P (tH {i) )) < f(P) + c t rr(T (i)T # (i) ) 
Choose step-size f w := t 



7 



Remark Convergence of the geometric CG method, which uses the retraction and vector 
transport as described above together with the Fletcher-Reeves update /3fr and the 
strong Wolfe-Powell condition, is guaranteed by a result of Ring and Wirth ( 2012| l 
in the sense that liminf^oo ||F W || = 0. Note, that the Wolfe-Powell condition would 
require a slight modification of Algorithm[2] However, we do not stress this issue since 
it has no impact on the practical convergence behavior. 



4.3 Implementation of CG on Grassmannian 

So far, the algorithm outlined above requires full (m x ra)-matrices for the iterates 
pw f j< l > and H®. This is a drastic limitation on the performance of a practical imple- 
mentation. In this section we derive a new retraction and show how it can be used to 
avoid full matrix multiplication and to reduce the storage requirements tremendously. 
The key idea is to decompose the projection matrices P e Gr^ m into P = UU T and to 
iterate on Stiefel matrices U e St^ ,„ instead. Moreover, one can exploit the structure of 
the tangent space 7xGr^ m at the standard projector I, that is 



h 




TiGv, 



k,m 



A T 
A 



As 



(m-k)xk 



Given U, a large (^-decomposition G f U yields a fast way of constructing 

V:=[U | £/ x ] e 0(m), 



(17) 



(18) 



where U J - e St( m _fc) jB) denotes a basis of the orthogonal complement of the subspace 
spanned by U. Then, if P = UU T and £ e Tp Gr^, the identities 



V T PV = 1 and V T £V 



A T 
A 



(19) 



hold for some A 6 R (m * )xi . Therefore, instead of storing the full P and £ it is sufficient 
to store U and A. Formally, this defines the bijection 



con v : T P Gr k>m -» R(«-*** con v (%) = A. 
The projection onto Tp Gr^ m follows from a straightforward calculation. 



(20) 



Lemma 2 Let V = [if | t/^J.P = UU T and Z s = |(Z + Z T ) be defined as above 
and the orthogonal projection onto the tangent space at P be denoted by np. Then the 
identity 



con v (n P (Z)) = (U^) T Z S U 



(21) 



holds. 



Proof 2 Using the definition ofnp(Z) and the fact that V T PV — X, 

V T n P (Z)V = [V T PV , V T [P,Z S ]V] 
= [1, [2, V T Z S V]] 



and the same structure as in ( |19| > is obtained. 
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Lemma [T] allows to choose a particular retraction that is easy to compute. Consider 
A - cony(^) and let 



A = , 



R 





be the large QR -decomposition of A, with 6 A e 0(m - k) and R an upper trian; 
necessarily invertible) (lc x £)-matrix. Furthermore, define 



M(R) :-- 



-R T 
h 



and its g-factor 6 M e 0(2k). 



Lemma 3 Let apgig) be a retraction as in ( |12| i, where is chosen as 6 :— V 
Then the Stiefel matrix 



U :=G 



M 

hn-lk 



e St* 



(22) 
;ular (not 

(23) 



/* 

o e A 



(24) 



satisfies ap^) — UU T . 

Proof 3 One can deduce qgtujj^giX) — ifm + 6 t [^,P]0)q — 









*m—2k 



from 



T [£ PW : 



h 

el 
h o 

B\ 


oja 



[V T £V, V T PV] 



-A T 
A 





[-R T 0] 
^ 



Therefore, the retraction is 



= V 



9m 

I m -2k 



9 T P9 



e ~ll 





h 
A 



0m 

I m -2k 





Im-2k 

h o 
o e; 



o o A 



h 

0A 





4 o 

o 6 A 



9 M 

/, 





m— 2k 



V 



h o 
o e A 



6m 






I m -2k 



[h o] 



91 









o e\ 



which completes the proof. 



Similarly to the retraction, the vector transport from ( fT3] l is simplified, as is described 
in the following. 
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Lemma 4 Let 8, £ P, U be as above and V := | J/ -1 ]. Then for rj e Tp Gr^,,, and 
B :— conv(rf), the identity con^(r^.pg(^)) = 6 T A B holds. 



Proof 4 From ( fT9] l we have V T r]V = 
ten as 



Using the fact that 

B T [u | U^] = 6 T 
we can show that 
V T T^ e (lW = [U | 







. The vector transport can thus be writ- 



0M 




h o 







B T 




"/* o 




°M 





h 




o e\_ 




B 







6 A 
















Im-2k 









*m—2k 



o 

*m—k 









^m-2k 



B T 






Im-2k 

h o 

6 A 



B T 










Im-2k 



6 T [U | U x ] 



ejB o 

and cony(T(pg(rj)) follows again from the matrix structure. □ 

Algorithm [3] illustrates how the minimization of (|2]i on the Grassmannian is effi- 
ciently implemented in practice. Note, that therein, H and G denote the preimages of 
the respective bijection cony and thus are of dimension (to - k) x k. We do not fur- 
ther discuss the CG method in the Euclidean space as it is a standard method. In all 
minimization procedures, the convergence is evaluated by observing the progress in 
decreasing the cost function. 



Algorithm 3 Implementation of CG on Gr^„ 
input U = U® 

Obtain V and U x from Eq. ([T8j 
Compute G = {U x ) T Vf{UU T )U 
H = -G 
repeat 

Obtain 9 H , R from H as in Eq. ( |22| > 
Determine step-size t acc. Algorithm[2] 
Obtain M from gfl-dec. of M(tR) <[23]» 
Update U according to |24) 
Update V, U x , G as above 
Compute t(G) = 9 T H G and t(H) = 9 T H H 
Update H following ( fT5] l and ([16) 
until converged 

output t/ (i+1) 



10 



5 Robust subspace tracking 



In this section, we extend the aforementioned mathematical tools to robustly track the 
underlying subspace. To that end, we require our sparsity measure h to be separable, 
meaning that the sparsity measure of a matrix A = \a\,az, . . . , a„] consists of the sum 
of sparsity measures of its columns a,-. Note, that all sparsity measures given in Eq. 
( |34l > fulfill that condition. By slight abuse of notation, we write 

h([A | a] ) = h(A) + h(a), (25) 

where a e R m is the last column of the matrix |A | a]. Assume now that the current 
observation matrix a® is updated by a new observation vector leading to a new 
observation matrix = [x | ij. Following (|2]), the new optimization problem is 

min h(\x I Jcl-^(f/[F I y])). (26) 

U€St kjn ,YeB. ha >,yeR k v L J v L J " 

We show how the above problem can profit from knowledge of the optimal subspace 
in the previous iteration. Due to the separability of h and the properties of A we can 
separate the current observation from the previous optimization problem by rewriting 

h[[t | x]-^(f/[F | y])) = h(X-A(UY)) + h(x-A(Uy)). (27) 

Furthermore, we introduce a weighting factor < w < 1 for incoming observations, 
which will play the role of a forgetting factor in the actual tracking algorithm. The 
optimization problem for subspace tracking now reads as 

min (1 - w)h(X - A(UY)) + wh(x - A(Uy)). (28) 

UeSt^ m ,YeR tx ",yeK. k 

Assume P w and L w = f/ w 3 /(,) that minimize h(X - A{L)) and an initial estimate Iq = 
P^Xq with A{xq) = are available for the current data vector. Then a two-step 

update rule similar to |5]) and |6]l can be formulated. In a first step, update the dominant 
subspace estimate 

P iM) * arg min (1 - w)h{X - A(PL {i) )) + wh(5c - A(Pl )) (29) 

via gradient descent on Gr^ m and decompose = [/( 1+i )[/('+i)t Note that the 

gradient F (,+I) can be updated from the previous F w and the current observation. Then 
adjust the coordinates of the current low-rank estimate by optimizing 

y (i+l) = arg min (1 - w)h(x - A(U (i+1) y)). (30) 



While ([30} can be solved in the same fashion as before by employing a CG method 
in Euclidean space, the solution of problem ( f2"9j ) is approximated by one gradient de- 
scent step on the Grassmannian, i.e. 

= a P <»(pr (i+l) ) = apafal - w)F w + w r (i+1) )). (31) 
In this formula, /3 is a suitable step size from standard gradient descent methods, 

r (i) = gradph(X - A{PL (i) )) (32) 
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is the Riemannian gradient at the preceding iteration, and 

y (,+1) = gradpHx - A(Pl (i+1) )) (33) 

the Riemannian gradient for the current observation. The tracking scheme is illustrated 
inAlg.|4] 

Algorithm 4 Subspace tracking 
Initialize: 

Select initial data set consisting of at least k data vectors. 

Estimate t/ (0) and F (0) following the alternating minimization outlined in Alg. HI 

jJO) _ {/(0)y(0) 
p(0) _ U (0) U (0)T 

Compute r (0) (X , P <0) , L (()) ) according to (|32j 
for each new sample x = x (l+l} do 

Choose w and fi as well as Xq, s.t. A{xo) = x. 

Initialize y = U® T x and l = U (i) y 

Compute r ( '' +I) (jc , P®, l ) according to ([33]) 

Update r (;+1) = (1 - w)r w + wy (i+l) 

Update = a P (o(y6r (, ' +1) ) cf. ((29), ((31} 

findf/ (i+1) s.t. u m) U (i+l)T = P (M) 
y< i+1 ) = arg min ^(Jc - cf. ((30j 

fi+l) _ j/ti+l) («+l) 

end for 



To further reduce computational cost, the following lemma proves useful. 

Lemma 5 Define V = [C7 | f/- 1 ] ant/ f/ie Riemannian Gradients T and y as above 
and denote by G — (U x ) T rU and g = (U ± ) r yU their respective preimages according 



to ( |20[ ). Then the identity 

cony((l - w)T + wy) = (1 - w)G + 

The proof follows from the linearity of (|20(l. 



Using this result, the computational effort reduces tremendously, because explicit 
computation of and apmffiT^'^) is not required for the new iterate £/ (,+1 \ Consid- 
ering |7]i for the case of a single data vector, y and with it g become rank-one matrices. 
Thus, it is sufficient to compute G and its (^-decomposition i n the initialization phase 
and then to perform lightweight rank-one updates (cf. Golub and Van Loan ( 1996| l) 



on the Q and R factors to update G and U in each iteration. The detailed procedure is 
illustrated in Algorithm [5] 
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Algorithm 5 Implementation of the Gradient descent update 



input xq, l , P®, V = [u® | U 1 - ®\ 0® and R® from the ^^-decomposition of G {i) 
as in ( p2) 

Compute y(xo, P w , Zo) an d g = cony(y) 
Find u, v s.t. uv T = g 

Obtain andfl (,+1) via rank-one-update from 0^ , R (l> , u and v 
[4 1 

Compute 6 = V 

G 

Determine optimum step-size t from line-search along H = -G (!+1 ^ 
Obtain 6» M from gfl-decomposition of M(tR (i+1) ) @ 
Update U according to ( p4j > 

output U (i+1 \^ +1) ,R^ 



6 Experiments and Evaluation 

This section gives an overview of the actual implementation and the performance of 
the presented algorithms. Firstly, we refer to practical issues such as the selection of 
suitable penalty functions and the choice of parameters. In the following, we eval- 
uate the performance of our approach in the Robust PCA task and demonstrate how 
the proposed method outperforms other state-of-the art algorithms. Concluding the 
experiments, we illustrate the method's behavior on real-world data by tracking a low- 
dimensional subspace in a visual background reconstruction task. 



6.1 Smooth penalty functions 



Inspired by the work of Gasso et al (2009 ), we investigate the following smoothed 
^-surrogates: 

n m £ 

h \p . R mx« x ^ y y ^2 +fx y > o < p < 1 (lpnorm) (34a) 

li° g : R mx " -> R + , X ^ J] Yj log ( 1 + ~m ) ( lo SMi1hm) ( 34b ) 

/=! i=l 
n m 

/? atan . R mx» ^ atan 2 ^ (-34^ 

7=1 ''=1 

Using this kind of cost function instead of e.g. the ^i-norm has several advantages. 
The smoothing parameter fj. can be tuned in a way that the cost function either penal- 
izes large outliers (similar to the Frobenius norm) or enforces sparsity on the outliers 
regardless of their magnitude. Within a practical implementation it is observed that 
larger values of /i lead to faster convergence of the optimization algorithm while small 
values, as expected, lead to much sparser residuals. We profit from this flexibility by 
adjusting the smoothing parameter and thus the resulting sparsity. Specifically, in the 
subspace reconstruction task we prefer faster convergence in the beginning in order to 
quickly obtain a reliable rough estimate of the subspace and reduce fi after each alter- 
nation. In the limit /j — > the cost functions behave similarly as the ^o-norm, which 
leads to the best results, as the following experiments demonstrate. 
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6.2 Numerical experiments for the subspace reconstruction task 



In order to have a quantitative measure for the recovery performance of the robust PCA 
algorithms, we conducted numerical experiments with artificial test data and employed 
the available ground truth for a quantitative evaluation. We compare our algorithm to 
SpaRCS (jWaters et al| |20TT]l, RPCA (EAL M and IALM) ( |Lin et al| [2QT0] >, GRASTA 
(He et al |2012| l and GoDec (Zhou and Tao 201 1 1, and use MATLAB implementations 
provided by the authors. In all experiments we constructed the test data X e R" ,x " to 
be the sum of a matrix L of fixed rank k (with k < m) and a sparse matrix S . The low- 
rank component is obtained by computing the singular value decomposition U"LV T of 
a zero-mean, Af(0, l)-distributed random matrix, assigning zero to all singular values 
cr,, i = k+l ...m and reconstructing the matrix F = UYV T . In order to control the mag- 
nitude of the entries for varying k we scale the entries of F to a unit sample standard 
deviation, obtaining a normalized L = ^pjF. The entries of S are randomly placed 
and uniformly distributed in [-5, 5], which makes a well-proportioned relation between 
data point s and outliers. Some other comparisons follow the data model of Candes 
(2009), where the low -rank matrix is the product of two Af(Q, ^-distributed ran- 



et al 



dom matrices. The entries of S follow a Bernoulli distribution and thus surpass the 
amplitude of the data points by several orders. In our evaluation the outliers cannot be 
detected just from their magnitude, but they are still large enough to severely distort 
the estimated subspace. 
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(h) GoDec 



Figure 1 : Subspace reconstruction performance measured by phase transitions in rank 
and sparsity (dark area: reconstruction failed). Our method (a)-(c) achieves the highest 
reconstruction rate. 



A widely used performance measure for subspace reconstruction are the phase tran- 
sitions in rank and sparsity. A data set of dimension m — n — 400 is generated and the 
subspace recovery is evaluated while varying both the relative rank k/m and the relative 
sparsity p$ - ||S||o/wn in the range of 0.025 to 0.5. The reconstruction is considered 

abortive if > 0.05. 

\\l\\f 

The results in Figure [11 demonstrate that the proposed framework (a)-(c) using £q- 
surrogates covers a broad range of scenarios and surpasses SpaRCS, GRASTA and the 
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RPCA methods in the subspace reconstruction task. Especially in the cases of either 
k/m being very small and p$ being very large or vice versa, our algorithm is still able 
to recover the subspace while other methods fail. Out of all compared methods, the 
GoDec algorithm comes closest to our performance - however, it has to be stressed that 
we must feed the exact cardinality of S into the GoDec algorithm, which is not available 
in a real-world application. We chose the number of iterations between the alternating 
minimization steps as / = 50 for all phase transition plots. We choose p = 0.5 and 
shrink p from 0.9 to 10~ 4 for l |34a) , from 2 to 0.005 for ( |34b| i and, respectively, from 2 
to 0.05 for p4cl 



6.3 Incomplete observations 

In a second experiment we evaluate our algorithm for the case of data being incom- 
pletely observed and compare the performance against SpaRCS, which is especially 
designed for compressive measurements. The results in Figure [2] illustrate that our 
method recovers a broader range of configurations than the competing SpaRCS, which 
has its strengths especially for very low-rank matrices. In general, with less samples 
being available, the lower the dimension of the underlying subspace and the sparser 
the outliers have to be for a successful reconstruction. We stress that (i) the outliers 
are placed at random positions and (ii) the data is subsampled in a completely random 
way. 
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Figure 2: Phase transitions at recovering a low-rank matrix from incomplete observa- 
tions 



6.4 Noisy reconstruction 

To investigate the behavior of the proposed algorithm in the presence of additive Gaus- 
sian noise we perform the following experiments: p and k/m are fixed to 0.1 and a 
Gaussian noise term N e R mx " f a particular energy level is added to the data. We run 
our alternating minimization method for 10 iterations and compare the performance 
and the runtime against the above mentioned competing algorithms. Intuitively, a low- 
rank and sparse decomposition on a noisy dataset should mostly affect the sparse com- 
ponent, as Gaussian noise is full rank and only a small amount of noise will affect 
the low -rank component. Figure [5] illustrates the recovery precision for the low-rank 
matrix, where the SNR measures the relation between the energy of L and N. As the 
results reveal, the proposed rank-controlling method provides a rather noise-robust es- 
timation of the subspace, which supports our choice of a simple model. Like all other 
methods, the reconstruction quality deteriorates with increasing noise level. However, 
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at a moderate noise level our algorithm performs equally well as GoDec, which in con- 
trast to our method requires additional knowledge about the cardinality of the sparse 
component. 

Concerning the average runtime for solving the noisy decomposition task on a 
desktop computer in MATLAB, GoDec performs fastest in 1.5 seconds and the IALM 
method requires about 2.7 seconds. Our framework needs an average runtime between 
4.2 (atari) and 5.3 (Ipnorm) seconds to solve the task and outperforms SpaRCS, which 
requires 6.1 seconds. Lastly, EALM and GRASTA are rather costly with 30 and, respec- 
tively, 111 seconds. 

6.5 Subspace tracking on a real-world example 




(a) original x (b) l(n = 2) (c) s (/J = 2) (d) I (p = 0.01) (e) s (jU = 0.01) 



Figure 4: Subspace tracking - Background and foreground for different smoothing 
parameters 



For the purpose of robustly tracking the underlying subspace, we selected the 
dataset lobby from Li et al (2004) and perform the widely popular background sub- 
traction task. In this task, the background of a pixel-wisely sampled video is modeled 
by a low-rank approximation and foreground objects can automatically be extracted as 
they are assumed to be sparse. The selected scene shows an office scenario with peo- 
ple occasionally walking through the scene. It is especially challenging as the lighting 
conditions change significantly after about 400 frames. We initialize our tracking al- 
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gorithm with the first 50 frames of the sequence using the proposed alternating min- 
imization method for 10 steps and perform our tracking algorithm on the subsequent 
frames. From our ^-surrogates we choose ( |34c| > and upper-bound the desired rank k 
by 2. To demonstrate the influence of the smoothing parameter we fix /u — 2 for one 
experiment and for the other we shrink yi from 2 to 0.01 in the initialization phase and 
leave it for the following tracking procedure. The choice of a suitable weighting param- 
eter is a trade-off between reaction time (i.e. how long it takes to adapt the subspace) 
and the risk of overfitting the subspace (i.e. / m x), which leads to ghost images in the 
reconstruction. For all experiments we select w = 0.05 as a weighting factor. 

The sample observation of frame # 365 in Figure [4] illustrates that in both config- 
urations the subspace / is successfully recovered, as the sparse foreground estimates 
s = x — I Qc) and (e)) contain only the moving person in the scene. However, the 
smaller jj. is selected, the less blurry the extracted silhouette, as becomes clear from 
the comparison between the two experiments. The more sparsity on the residual is en- 
forced, the more effectively can ghost images from previous observations be suppressed 
in the extracted foreground objects. 




(a) frame #425 (b) frame #437 (c) frame #450 (d) frame #462 (e) frame #475 



Figure 5: Tracking a change in room illumination. Upper: original video, lower: low- 
rank approximation (/i = 0.01) 

Figure |5]illustrates the behavior of the algorithm when a change occurs in the back- 
ground. Although the lighting conditions following frame # 425 have not been ob- 
served during the initialization, our tracking algorithm adjusts the subspace to the new 
background within a small number of frames. 

7 Conclusion 

We present a framework for Robust PCA that is able to recover a low-rank matrix from 
a data set corrupted by sparse outliers and missing data. Experiments show that our 
method covers a wider range of scenarios in terms of higher rank and greater number of 
outliers than other state of the art methods. Even if the data is incompletely observed a 
comparably good performance is achieved. The same holds true for data sets corrupted 
by additive Gaussian noise. 

Instead of tackling a convex relaxation of the problem, our algorithm reaches as 
closely as possible to the ideal Robust PCA objective by modelling the problem as 
simply as possible and making no additional assumptions, e.g. on the number of sparse 
outliers. Instead of the widely-popular l\ -measure we propose using ^-surrogates, 
which can be adjusted so that the optimization is performed rather smoothly or, in the 
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limit, sparsity is enforced in an to-like behavior, which leads to superior performance. 
In order to obtain an inherent upper bound on the dimension of the dominant subspace 
we propose an optimization problem on the Grassmannian and derive an efficient re- 
traction that saves time and storage and makes the algorithm applicable for large-scale 
problems. Furthermore we show how the method can be adapted to allow tracking a 
subspace that varies over time and use real-world data to demonstrate the performance. 
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